-
Notifications
You must be signed in to change notification settings - Fork 31
[WIP] Add pause resume sandbox proposal #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Zhonghu Xu <xuzhonghu@huawei.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello @hzxuzhonghu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant enhancement to AgentCube Sandbox management by implementing an idle-aware pause and resume lifecycle. Instead of outright deleting idle sandbox pods, the system will now leverage the containerd API to pause their containers, thereby preserving user state and significantly reducing resource consumption. This change aims to improve developer experience by offering near-instantaneous sandbox resumption. The proposal thoroughly outlines the technical design, including how containers are paused and resumed, the mechanisms for triggering a resume (via annotations or NATS JetStream), and the necessary API and CRD modifications to support this new functionality. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a comprehensive design proposal for a sandbox pause/resume feature. The document is well-structured and covers motivation, design details, API changes, and implementation plans in great detail. My review focuses on improving the clarity and consistency of the proposal. I've suggested a change to the control flow diagram to better represent the reconciliation logic and pointed out a minor inconsistency in timestamp formatting.
| flowchart TD | ||
| Start["Reconcile Sandbox"] --> CheckIdle{Idle timeout exceeded?} | ||
| CheckIdle -- "No" --> Exit["Requeue/Exit"] | ||
| CheckIdle -- "Yes" --> VerifyPod["Verify bound pod exists"] | ||
| VerifyPod -- "Missing" --> Warn["Log warning & requeue"] | ||
| VerifyPod -- "Found" --> PauseContainers["Pause pod containers via containerd"] | ||
| PauseContainers --> UpdateState["Update annotations/status: state=Paused"] | ||
| UpdateState --> EmitPaused["Emit SandboxPaused event"] | ||
| EmitPaused --> Exit | ||
| Start["Reconcile Sandbox"] --> WaitResume{Resume signal?} | ||
| WaitResume -- "Annotation" --> HandleAnnotation["Process resume annotation"] | ||
| WaitResume -- "Message queue" --> HandleMQ["Process MQ resume message"] | ||
| HandleAnnotation --> ResumeContainers | ||
| HandleMQ --> ResumeContainers | ||
| ResumeContainers["Resume containers via containerd"] --> RefreshActivity["Refresh last-activity timestamp"] | ||
| RefreshActivity --> UpdateRunning["Set state=Running & clear triggers"] | ||
| UpdateRunning --> EmitResumed["Emit SandboxResumed event"] | ||
| EmitResumed --> Exit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The control flow diagram is a bit confusing as it shows two separate Start nodes for what should be a single reconciliation loop. A reconciler typically has one entry point and then branches based on the state of the resource. I suggest restructuring the diagram to reflect a single, unified flow, which will improve the clarity of the proposed design. The suggested diagram below first checks for a resume signal and then for idleness, which is a more common pattern in controller logic.
| flowchart TD | |
| Start["Reconcile Sandbox"] --> CheckIdle{Idle timeout exceeded?} | |
| CheckIdle -- "No" --> Exit["Requeue/Exit"] | |
| CheckIdle -- "Yes" --> VerifyPod["Verify bound pod exists"] | |
| VerifyPod -- "Missing" --> Warn["Log warning & requeue"] | |
| VerifyPod -- "Found" --> PauseContainers["Pause pod containers via containerd"] | |
| PauseContainers --> UpdateState["Update annotations/status: state=Paused"] | |
| UpdateState --> EmitPaused["Emit SandboxPaused event"] | |
| EmitPaused --> Exit | |
| Start["Reconcile Sandbox"] --> WaitResume{Resume signal?} | |
| WaitResume -- "Annotation" --> HandleAnnotation["Process resume annotation"] | |
| WaitResume -- "Message queue" --> HandleMQ["Process MQ resume message"] | |
| HandleAnnotation --> ResumeContainers | |
| HandleMQ --> ResumeContainers | |
| ResumeContainers["Resume containers via containerd"] --> RefreshActivity["Refresh last-activity timestamp"] | |
| RefreshActivity --> UpdateRunning["Set state=Running & clear triggers"] | |
| UpdateRunning --> EmitResumed["Emit SandboxResumed event"] | |
| EmitResumed --> Exit | |
| flowchart TD | |
| Start["Reconcile Sandbox"] --> WaitResume{Resume signal?} | |
| WaitResume -- "Yes (Annotation or MQ)" --> ResumeContainers["Resume containers via containerd"] | |
| ResumeContainers --> RefreshActivity["Refresh last-activity timestamp"] | |
| RefreshActivity --> UpdateRunning["Set state=Running & clear triggers"] | |
| UpdateRunning --> EmitResumed["Emit SandboxResumed event"] | |
| EmitResumed --> Exit["Requeue/Exit"] | |
| WaitResume -- "No" --> CheckIdle{Idle timeout exceeded?} | |
| CheckIdle -- "No" --> Exit | |
| CheckIdle -- "Yes" --> VerifyPod["Verify bound pod exists"] | |
| VerifyPod -- "Missing" --> Warn["Log warning & requeue"] | |
| VerifyPod -- "Found" --> PauseContainers["Pause pod containers via containerd"] | |
| PauseContainers --> UpdateState["Update annotations/status: state=Paused"] | |
| UpdateState --> EmitPaused["Emit SandboxPaused event"] | |
| EmitPaused --> Exit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a comprehensive design proposal for implementing pause/resume functionality for AgentCube sandbox pods. Instead of deleting idle sandbox pods, the proposal advocates for pausing containers via containerd's native pause/resume API, which preserves user state while freeing CPU resources. The design includes both annotation-based and NATS JetStream message queue-based resume triggers, comprehensive observability, and a detailed implementation plan.
Key Changes:
- Design document for sandbox pod lifecycle management with pause/resume capabilities via containerd
- Introduction of NATS JetStream as the message queue for asynchronous resume operations
- New annotation-based API for tracking sandbox state and triggering resume operations
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| 1. Establish a containerd client: | ||
| - `client, err := containerd.New("/run/containerd/containerd.sock")`. | ||
| - Use a short timeout context (e.g. 5–10 seconds) to avoid blocking reconciles. |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout value of "5–10 seconds" for containerd operations may be insufficient for pause/resume operations under heavy node load. When a node is under memory pressure or high I/O load, containerd operations can take longer than 10 seconds. Additionally, if pausing/resuming multiple containers in sequence (as described in line 90 "For each workload container"), the total time could exceed this timeout.
Consider:
- Using a per-operation timeout (e.g., 30 seconds per container) rather than a global timeout for all containers
- Making the timeout configurable
- Documenting the expected time for pause/resume operations under normal and stressed conditions
- Implementing proper cancellation handling if a context timeout occurs mid-operation
Without adequate timeouts, the reconciliation loop could block, preventing other sandboxes from being processed.
| - Use a short timeout context (e.g. 5–10 seconds) to avoid blocking reconciles. | |
| - For each container operation, use a per-operation timeout context (e.g., 30 seconds per container). The timeout should be configurable to accommodate different node loads and operational environments. | |
| - Document the expected time for pause/resume operations under both normal and stressed conditions, and implement proper cancellation handling if a context timeout occurs mid-operation. |
| - New annotations: | ||
| - `sandbox.lifecycle.volcano.sh/state`: `Running|Paused|Resuming|Error`. | ||
| - `sandbox.lifecycle.volcano.sh/resume-requested-at`: RFC3339 timestamp. | ||
| - `sandbox.lifecycle.volcano.sh/last-paused-at`, `sandbox.lifecycle.volcano.sh/last-resumed-at` (optional audit). |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed annotation keys use domain-qualified names (e.g., sandbox.lifecycle.volcano.sh/state, sandbox.lifecycle.volcano.sh/resume-requested-at) which is inconsistent with existing annotation conventions in the codebase. The existing annotations use simple keys without domain prefixes: last-activity-time and creator-service-account (see pkg/apiserver/k8s_client.go:26,28).
For consistency, consider either:
- Using simple annotation keys like
sandbox-state,resume-requested-at,last-paused-at,last-resumed-at - Or, if domain-qualified names are preferred for these new lifecycle annotations, document why this departure from the existing pattern is necessary and consider migrating existing annotations to the same pattern
Mixing annotation styles can lead to confusion about which convention to follow for future additions.
| - New annotations: | ||
| - `sandbox.lifecycle.volcano.sh/state`: `Running|Paused|Resuming|Error`. | ||
| - `sandbox.lifecycle.volcano.sh/resume-requested-at`: RFC3339 timestamp. | ||
| - `sandbox.lifecycle.volcano.sh/last-paused-at`, `sandbox.lifecycle.volcano.sh/last-resumed-at` (optional audit). |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The annotation sandbox.lifecycle.volcano.sh/state storing values like "Running|Paused|Resuming|Error" should be clarified. Based on Kubernetes conventions, runtime state should typically be stored in status fields rather than annotations. Annotations are intended for metadata and non-authoritative information. Consider:
- Moving the state to a proper
status.phaseorstatus.lifecycleStatefield - If keeping as annotation, explain why this design choice was made over using status fields
- Document how conflicts are resolved if both annotation and status contain state information
This is especially important given that line 172 mentions status.conditions entries, suggesting status is already being used for related information.
| - Use a short timeout context (e.g. 5–10 seconds) to avoid blocking reconciles. | ||
| 2. For each workload container: | ||
| - Strip the `containerd://` prefix from `containerID`. | ||
| - Look up the corresponding task with `client.LoadTask(ctx, containerID)`. |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API call client.LoadTask(ctx, containerID) is incorrect. The containerd Go client API requires loading a container first and then getting its task. The correct pattern is: container, err := client.LoadContainer(ctx, containerID) followed by task, err := container.Task(ctx, nil). The LoadTask method does not exist on the containerd client in this way.
| - Look up the corresponding task with `client.LoadTask(ctx, containerID)`. | |
| - Look up the corresponding task: | |
| - `container, err := client.LoadContainer(ctx, containerID)` | |
| - `task, err := container.Task(ctx, nil)` |
Signed-off-by: Zhonghu Xu <xuzhonghu@huawei.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| 3. If any container fails to pause: | ||
| - Record a `Paused` condition with `status=False` and a reason (e.g. `ContainerdPauseError`). | ||
| - Emit a Kubernetes event on the `Sandbox` and the pod. | ||
| - Optionally retry with backoff; give up after a small number of attempts to keep reconciliation bounded. |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling description states that the controller should "give up after a small number of attempts to keep reconciliation bounded," but this conflicts with the idempotency design mentioned in lines 125-133. If the reconciler is supposed to be idempotent and converge to desired state, giving up on pause operations could leave the sandbox in an inconsistent state. Consider clarifying whether failed pause operations should transition to an error state and wait for manual intervention, or if they should continue retrying on subsequent reconciliation loops.
| ```mermaid | ||
| flowchart TD | ||
| Start["Reconcile Sandbox"] --> CheckIdle{Idle timeout exceeded?} | ||
| CheckIdle -- "No" --> Exit["Requeue/Exit"] | ||
| CheckIdle -- "Yes" --> VerifyPod["Verify bound pod exists"] | ||
| VerifyPod -- "Missing" --> Warn["Log warning & requeue"] | ||
| VerifyPod -- "Found" --> PauseContainers["Pause pod containers via containerd"] | ||
| PauseContainers --> UpdateState["Update annotations/status: state=Paused"] | ||
| UpdateState --> EmitPaused["Emit SandboxPaused event"] | ||
| EmitPaused --> Exit | ||
| Start["Reconcile Sandbox"] --> WaitResume{Resume signal?} | ||
| WaitResume -- "Annotation" --> HandleAnnotation["Process resume annotation"] | ||
| WaitResume -- "Message queue" --> HandleMQ["Process MQ resume message"] | ||
| HandleAnnotation --> ResumeContainers | ||
| HandleMQ --> ResumeContainers | ||
| ResumeContainers["Resume containers via containerd"] --> RefreshActivity["Refresh last-activity timestamp"] | ||
| RefreshActivity --> UpdateRunning["Set state=Running & clear triggers"] | ||
| UpdateRunning --> EmitResumed["Emit SandboxResumed event"] | ||
| EmitResumed --> Exit | ||
| ``` |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mermaid diagram has a logic issue: it shows two separate flows both starting with Start["Reconcile Sandbox"], which creates duplicate starting nodes. The pause and resume flows should be part of a single flowchart with conditional branches, or they should be clearly separated as different scenarios. Currently, both CheckIdle and WaitResume appear to be evaluated simultaneously from the same start node, which doesn't make logical sense.
| ###### MQ consumer and controller integration | ||
|
|
||
| 1. A dedicated `ResumeConsumer` process (can be part of the agentd binary) connects to the NATS server. | ||
| 2. It subscribes to the `agentcube.sandbox.resume` subject using a **durable subscription** (queue group or durable name) so that resumes are processed even if there are multiple replicas. |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "ensure sandboxes are processed" should be "ensures that resume messages are processed" or similar. The phrase "so that resumes are processed" is grammatically unclear - it should specify what "resumes" refers to (resume messages or resume operations).
| 2. It subscribes to the `agentcube.sandbox.resume` subject using a **durable subscription** (queue group or durable name) so that resumes are processed even if there are multiple replicas. | |
| 2. It subscribes to the `agentcube.sandbox.resume` subject using a **durable subscription** (queue group or durable name) so that resume messages are processed even if there are multiple replicas. |
|
|
||
| - **Simplicity and footprint**: NATS is a single small binary and easy to run as a Kubernetes StatefulSet. It has fewer moving parts than Kafka or RabbitMQ. | ||
| - **Cloud-native and Kubernetes-friendly**: There are mature Helm charts and operators; it integrates well with Kubernetes RBAC and multi-tenant setups. | ||
| - **At-least-once delivery via durable subscriptions**: While we are not using JetStream, NATS durable subscriptions and manual acknowledgements give us at-least-once behavior for consumers that stay connected. Combined with idempotent handling on the controller side, this is sufficient for resume signals. |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description mentions "durable subscriptions" for plain NATS core, but this could be misleading. In NATS terminology, "durable subscriptions" typically refer to queue groups with durable names, which provide at-most-once delivery per consumer group but don't persist messages when all consumers are offline. The document later acknowledges this limitation (line 260-263), but line 207 might give a false impression of stronger durability guarantees than plain NATS core actually provides. Consider clarifying that this refers to queue groups with durable consumer names, not JetStream's persistent message storage.
| - **At-least-once delivery via durable subscriptions**: While we are not using JetStream, NATS durable subscriptions and manual acknowledgements give us at-least-once behavior for consumers that stay connected. Combined with idempotent handling on the controller side, this is sufficient for resume signals. | |
| - **At-least-once delivery via queue groups with durable names (not JetStream; messages are not persisted if all consumers are offline)**: While we are not using JetStream, NATS queue groups with durable names and manual acknowledgements give us at-least-once behavior for consumers that stay connected. Combined with idempotent handling on the controller side, this is sufficient for resume signals. |
| - **Annotation path**: Users (or apiserver) set `sandbox.lifecycle.volcano.sh/resume-requested-at`; reconciler resumes and clears annotation. | ||
| - **Message queue path**: Integrate with a concrete MQ to emit durable resume requests into the controller queue. | ||
| 4. Add feature flags and configuration knobs: idle timeout, pause enablement, containerd socket path, per-namespace opt-out, fallback behavior (delete vs keep running). | ||
| 5. Update apiserver and SDKs to surface paused state and expose a `/v1/sandboxes/{sandboxId}/pause` API. |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API endpoint path mentioned here (/v1/sandboxes/{sandboxId}/pause) doesn't match the resume endpoint mentioned in line 173 (/v1/sandboxes/{sandboxId}/resume). If this proposal is introducing both pause and resume operations, both endpoints should be documented consistently. Additionally, the text only mentions exposing a pause API, but the context suggests resume should also be exposed. Consider documenting both endpoints or clarifying if only one is being added.
| 5. Update apiserver and SDKs to surface paused state and expose a `/v1/sandboxes/{sandboxId}/pause` API. | |
| 5. Update apiserver and SDKs to surface paused state and expose `/v1/sandboxes/{sandboxId}/pause` and `/v1/sandboxes/{sandboxId}/resume` APIs. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ##### Security and configuration | ||
|
|
||
| - The controller needs sufficient permissions to talk to the containerd socket. There are two deployment options: | ||
| - Run the controller as a DaemonSet on each node with hostPath mount of `/run/containerd/containerd.sock`. |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation mentions "Run the controller as a DaemonSet on each node with hostPath mount" (line 146), but the existing AgentdReconciler appears to be a cluster-scoped controller, not a DaemonSet. This architectural change has significant implications for deployment, RBAC, and high availability that aren't fully addressed. Consider adding a dedicated section on deployment architecture changes and migration strategy, or clarify if this refers to a separate component from the existing AgentdReconciler.
| - Document that environments requiring strong durability can swap in a persistent MQ (e.g. Kafka) behind the same interface. | ||
| - **Back-pressure**: If the Kubernetes API becomes slow, the consumer can slow its processing rate and rely on NATS’ internal buffering and flow control. | ||
| - **Security**: NATS can be configured with TLS and token-based or NKey authentication. The consumer uses a dedicated NATS account/creds with access only to the required subjects. | ||
|
|
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The security section mentions "token-based or NKey authentication" for NATS (line 265), but doesn't address authentication/authorization for the annotation-based resume path. Any client with permission to update Sandbox annotations can trigger a resume, potentially bypassing intended access controls. Consider documenting the required RBAC permissions and whether additional authorization checks are needed beyond Kubernetes RBAC.
| - **Annotation-based resume path**: The annotation-based resume mechanism relies on Kubernetes RBAC for access control. Only trusted controllers or users should be granted permission to update the `Sandbox` resource's annotations (specifically, the `sandbox.lifecycle.volcano.sh/resume-requested-at` and related keys). Granting broader update access could allow unauthorized clients to trigger resume operations, potentially bypassing intended access controls. It is recommended to define fine-grained RBAC roles that restrict annotation updates to only those principals that are authorized to trigger resume actions. If additional authorization checks are required beyond Kubernetes RBAC, they should be implemented in the controller logic and documented here. |
|
|
||
| ##### High-level flow | ||
|
|
||
| - For each `Sandbox` CRD, there is a bound pod (the sandbox pod) created by the apiserver. |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference to "apiserver" on line 70 is inconsistent with the actual component that creates pods. Based on the codebase (pkg/apiserver/k8s_client.go), the apiserver creates Sandbox CRDs, but the Kubernetes controller (likely via the agent-sandbox operator) creates the actual pods from the Sandbox spec. Consider clarifying this to say "created by the Kubernetes controller" or "created from the Sandbox spec".
| - For each `Sandbox` CRD, there is a bound pod (the sandbox pod) created by the apiserver. | |
| - For each `Sandbox` CRD, there is a bound pod (the sandbox pod) created by the Kubernetes controller from the Sandbox spec. |
|
|
||
| - **Duplicates**: At-least-once delivery (via durable subscription and reconnect behavior) means the same resume intent may be delivered multiple times. We handle this by making annotations idempotent: setting `resume-requested-at` to the latest timestamp is safe even if resume is already in progress or completed. | ||
| - **Ordering**: If multiple resume requests arrive, we only care about the most recent one. NATS preserves ordering per subject, but the controller logic does not rely on strict ordering. | ||
| - **Lost messages**: For long consumer outages, plain NATS does not guarantee persistence. To mitigate this, we: |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The statement "For long consumer outages, plain NATS does not guarantee persistence" (line 260) contradicts the earlier claim (line 207) that "NATS durable subscriptions and manual acknowledgements give us at-least-once behavior." This inconsistency should be resolved. The document should be clear upfront that core NATS without JetStream does not persist messages when no consumers are connected, which is a significant limitation for a resume signaling mechanism where reliability is important.
| 5. Update apiserver and SDKs to surface paused state and expose a `/v1/sandboxes/{sandboxId}/pause` API. | ||
| 6. Ship metrics/events/logging and document operational playbooks. | ||
| 7. Add end-to-end tests that validate containerd pause/resume against a real or fake runtime. | ||
|
|
||
| #### API / CRD Changes | ||
|
|
||
| - New annotations: | ||
| - `sandbox.lifecycle.volcano.sh/state`: `Running|Paused|Resuming|Error`. | ||
| - `sandbox.lifecycle.volcano.sh/resume-requested-at`: RFC3339 timestamp. | ||
| - `sandbox.lifecycle.volcano.sh/last-paused-at`, `sandbox.lifecycle.volcano.sh/last-resumed-at` (optional audit). | ||
| - Sandbox status additions: | ||
| - `status.conditions` entries for `Paused` and `ResumePending`. | ||
| - Optional: extend apiserver REST/SDK to POST `/v1/sandboxes/{sandboxId}/resume`. |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal mentions a /v1/sandboxes/{sandboxId}/pause API (line 161) but later only describes /v1/sandboxes/{sandboxId}/resume (line 173 and 193). The document should clarify whether both pause and resume APIs are being added, or if pause is only automatic via timeout. If both are added, document the use cases for manual pause. If only resume is exposed, remove the reference to the pause API endpoint.
| - The actual containerd task states. | ||
| - Any outstanding resume triggers. | ||
| - The controller then converges to the desired state (e.g. `Paused` or `Running`) without double-pausing or double-resuming. | ||
|
|
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design doesn't address potential race conditions when multiple reconciliation loops or resume triggers occur simultaneously. For example, if a pause operation is in progress and a resume annotation is added, or if two resume signals (annotation and MQ) arrive concurrently. Consider documenting concurrency control mechanisms such as using optimistic locking (resourceVersion checks), adding an operation-in-progress indicator, or ensuring the reconciler uses proper locking/queuing to serialize operations on the same sandbox.
| ##### Concurrency control and race condition prevention | |
| - To prevent race conditions when multiple reconciliation loops or resume triggers occur simultaneously (e.g., overlapping pause and resume requests, or concurrent triggers from annotations and MQ), the controller implements the following concurrency control mechanisms: | |
| - **Optimistic locking:** All updates to the sandbox resource use Kubernetes `resourceVersion` checks to ensure that only the latest version is modified. If a concurrent update occurs, the reconcile is retried with the new state. | |
| - **Operation-in-progress indicator:** The controller sets an explicit status field or annotation (e.g., `sandbox.lifecycle.volcano.sh/operation-in-progress`) to indicate when a pause or resume operation is underway. New operations are not started until the current one completes, ensuring serialization of state transitions. | |
| - **Serialized reconciliation:** The reconciler ensures that only one operation (pause or resume) is performed at a time for a given sandbox, either by relying on the controller-runtime's per-resource queueing or by explicit locking if needed. | |
| - These mechanisms ensure that the controller converges to the correct desired state without double-pausing, double-resuming, or inconsistent state transitions, even under high event rates or controller restarts. |
|
Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages. The list of commits with invalid commit messages:
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
No description provided.